Chronic Diseases Data Science Project
  • about Michael
  • Codes and Data
  • Introduction
  • Data Gathering
  • Naïve Bayes
  • Clustering
  • Dimensionality reduction
  • Decision Tress
  • Natual Language Processing
  • Conclusions

On this page

  • Introduction
  • Data preparation
    • Missing data handelling and normalization
  • PCA method
    • PCA results plots
    • PCA variance table for the first two principal component
    • PCA Biplot
  • t-SNE Method
  • Conclusions

Dimensionality reduction in Python

  • Show All Code
  • Hide All Code

  • View Source
Code
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns

Introduction

In the world of data science and chronic diseases, one of the most challenging thing is dealing with high-dimensional datasets. Dimensionality reduction techniques enables us to extract valuable insights from complex, high dimensional data while reducing computational burdens. In this coding section of our chronic diseases data science project, we employ dimensionality reduction methods that will not only strengthen our analyses but also discern hidden patterns, contribute to the development of effective strategies for disease prevention. By utilizing the power of these techniques, we advance our understanding and ability to address these critical public health concerns. For this part, I will be using the Framingham heart study data set.

Data preparation

Code
# Load the Frammingham Heart Study data set
data = pd.read_csv("data/frmgham2.csv")
data.head()
RANDID SEX TOTCHOL AGE SYSBP DIABP CURSMOKE CIGPDAY BMI DIABETES ... CVD HYPERTEN TIMEAP TIMEMI TIMEMIFC TIMECHD TIMESTRK TIMECVD TIMEDTH TIMEHYP
0 2448 1 195.0 39 106.0 70.0 0 0.0 26.97 0 ... 1 0 8766 6438 6438 6438 8766 6438 8766 8766
1 2448 1 209.0 52 121.0 66.0 0 0.0 NaN 0 ... 1 0 8766 6438 6438 6438 8766 6438 8766 8766
2 6238 2 250.0 46 121.0 81.0 0 0.0 28.73 0 ... 0 0 8766 8766 8766 8766 8766 8766 8766 8766
3 6238 2 260.0 52 105.0 69.5 0 0.0 29.43 0 ... 0 0 8766 8766 8766 8766 8766 8766 8766 8766
4 6238 2 237.0 58 108.0 66.0 0 0.0 28.50 0 ... 0 0 8766 8766 8766 8766 8766 8766 8766 8766

5 rows × 39 columns

Missing data handelling and normalization

Code
# Check for missing data and calculate percentages
missing_data = data.isnull().mean()

# Drop columns with more than 50% missing data
threshold = 0.5
columns_to_drop = missing_data[missing_data > threshold].index
data.drop(columns=columns_to_drop, inplace=True)
data = data.iloc[:, 1:]

# Remove rows with missing data as their percentages are low
data.dropna(inplace=True)

# Scale the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
data.head()
SEX TOTCHOL AGE SYSBP DIABP CURSMOKE CIGPDAY BMI DIABETES BPMEDS ... CVD HYPERTEN TIMEAP TIMEMI TIMEMIFC TIMECHD TIMESTRK TIMECVD TIMEDTH TIMEHYP
0 1 195.0 39 106.0 70.0 0 0.0 26.97 0 0.0 ... 1 0 8766 6438 6438 6438 8766 6438 8766 8766
2 2 250.0 46 121.0 81.0 0 0.0 28.73 0 0.0 ... 0 0 8766 8766 8766 8766 8766 8766 8766 8766
3 2 260.0 52 105.0 69.5 0 0.0 29.43 0 0.0 ... 0 0 8766 8766 8766 8766 8766 8766 8766 8766
4 2 237.0 58 108.0 66.0 0 0.0 28.50 0 0.0 ... 0 0 8766 8766 8766 8766 8766 8766 8766 8766
5 1 245.0 48 127.5 80.0 1 20.0 25.34 0 0.0 ... 0 0 8766 8766 8766 8766 8766 8766 8766 8766

5 rows × 36 columns

PCA method

PCA results plots

Code
# Apply PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(data_scaled)

# Plot the PCA results
sns.scatterplot(x=pca_result[:, 0], y=pca_result[:, 1])

plt.title('PCA Results')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')

plt.figure()
sns.barplot(x=[f"PC{i+1}" for i in range(pca.n_components_)], y=pca.explained_variance_ratio_)
plt.title('Explained Variance by PCA Components')
plt.ylabel('Explained Variance Ratio')

plt.show()

PCA variance table for the first two principal component

Code
# Create a DataFrame with the explained variance and cumulative variance
explained_variance = pca.explained_variance_ratio_
cumulative_variance = pca.explained_variance_ratio_.cumsum()

pca_table = pd.DataFrame({'Principal Component': [f"PC{i+1}" for i in range(len(explained_variance))],
                          'Explained Variance': explained_variance,
                          'Cumulative Variance': cumulative_variance})

print(pca_table)
  Principal Component  Explained Variance  Cumulative Variance
0                 PC1            0.256414             0.256414
1                 PC2            0.105013             0.361428

PCA Biplot

Code
# Get the loadings
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)

# Create a new matplotlib figure and axis
fig, ax = plt.subplots(figsize=(10, 7))

# Plot the loadings for each feature as arrows(I don't know how to draw biplots, used some assitance from Chatgpt)
for i, (loading1, loading2) in enumerate(loadings):
    ax.arrow(0, 0, loading1, loading2, head_width=0.05, head_length=0.1, length_includes_head=True, color='red')
    plt.text(loading1 * 1.2, loading2 * 1.2, data.columns[i], color='black', ha='center', va='center')

# Set plot labels and title
ax.set_xlabel('First Principal Component')
ax.set_ylabel('Second Principal Component')
ax.set_title('PCA Biplot')
ax.axhline(0, color='grey', lw=1)
ax.axvline(0, color='grey', lw=1)
ax.grid(True)

# Show the plot
plt.show()

Interpretation

  • The PCA analysis results show that the first two principal components (PC1 and PC2) plays a significant portion of the variance in the dataset. PC1 explains 25.6414% of the variance, and when combined with PC2 (10.5013%), they contain 36.1428% of the variance in total.

  • The scree plot and the table reflect a rapid drop in variance explained by each subsequent principal component after the first, which is typical in PCA analysis. This suggests that PC1 captures the most significant variance in the data, but there’s still a meaningful amount of variation represented in PC2.

  • Depending on the specific context of data science analysis, considering reducing the dimensionality of the data to these two components for further analysis might be beneficial.

  • The PCA biplot demonstartes that variables such as DIABP (diastolic blood pressure), AGE, HYPERTEN (hypertension), SYSBP (systolic blood pressure), and PREVHYP (previous hypertension) have a pronounced influence on the first principal component. Conversely, variables like CURSMOKE (current smoker status), CIGPDAY (cigarettes per day), and TIMEHYP (time to hypertension development) has a significant impact on the second principal component. This distinction provided by PCA enables optimal feature engineering. The PCA analysis effectively reduces the complexity of the data, allowing for a more manageable interpretation of the key features.

t-SNE Method

Code
# t-SNE analysis with different perplexity values
ps = [5, 30, 50, 100]

for p in ps:
    tsne = TSNE(n_components=2, perplexity=p, random_state=42)
    tsne_result = tsne.fit_transform(data_scaled)
    tsne_df = pd.DataFrame(tsne_result, columns=['TSNE1', 'TSNE2'])

    sns.scatterplot(data=tsne_df, x='TSNE1', y='TSNE2')
    plt.title(f't-SNE with Perplexity {p}')
    plt.xlabel('Component 1')
    plt.ylabel('Component 2')
    plt.show()

Interpretation

  • With perplexity 5, the plot shows numerous small clusters, indicating that the model is capturing more local structure in the data.

  • Perplexity 30 presents fewer, larger clusters, suggesting a more balanced view that incorporates broader data relationships.

  • At perplexity 50, the clusters are less distinct but still separate, which might mean the model is starting to prioritize the global data structure over local nuances.

  • Finally, perplexity 100 leads to even more overlap between clusters, showing that the model is now focusing mainly on the broader patterns in the data.

Conclusions

  • PCA and t-SNE are both powerful techniques for dimensionality reduction, each with distinct characteristics suitable for different types of data analysis.

  • PCA is a linear technique that reduces dimensions by transforming the data into a new coordinate system where the greatest variances by any projection of the data come to lie on the first coordinates, known as principal components. It is proficient at preserving global structure and is computationally efficient, making it suitable for datasets where linear relationships are dominant.

  • On the other hand, t-SNE is a non-linear technique, which is proficient at finding out local structures and clusters within the data. Unlike PCA, t-SNE can capture non-linear relationships by mapping the high-dimensional data to a lower-dimensional space in a way that preserves the data’s local neighborhood structure. This makes t-SNE particularly useful for exploratory data analysis and for datasets where the underlying structure is non-linear.

Source Code
---
title: Dimensionality reduction in Python
format:
  html:
    page-layout: full
    code-fold: true
    code-copy: true
    code-tools: true
    code-overflow: wrap
    embed-resources: true
bibliography: references.bib
---

```{python}
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
```

## Introduction

In the world of data science and chronic diseases, one of the most challenging thing is dealing with high-dimensional datasets. Dimensionality reduction techniques enables us to extract valuable insights from complex, high dimensional data while reducing computational burdens. In this coding section of our chronic diseases data science project, we employ dimensionality reduction methods that will not only strengthen our analyses but also discern hidden patterns, contribute to the development of effective strategies for disease prevention. By utilizing the power of these techniques, we advance our understanding and ability to address these critical public health concerns. For this part, I will be using the Framingham heart study data set.

## Data preparation

```{python}
# Load the Frammingham Heart Study data set
data = pd.read_csv("data/frmgham2.csv")
data.head()
```

### Missing data handelling and normalization

```{python}
# Check for missing data and calculate percentages
missing_data = data.isnull().mean()

# Drop columns with more than 50% missing data
threshold = 0.5
columns_to_drop = missing_data[missing_data > threshold].index
data.drop(columns=columns_to_drop, inplace=True)
data = data.iloc[:, 1:]

# Remove rows with missing data as their percentages are low
data.dropna(inplace=True)

# Scale the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
data.head()
```

## PCA method

### PCA results plots

```{python}
# Apply PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(data_scaled)

# Plot the PCA results
sns.scatterplot(x=pca_result[:, 0], y=pca_result[:, 1])

plt.title('PCA Results')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')

plt.figure()
sns.barplot(x=[f"PC{i+1}" for i in range(pca.n_components_)], y=pca.explained_variance_ratio_)
plt.title('Explained Variance by PCA Components')
plt.ylabel('Explained Variance Ratio')

plt.show()
```

### PCA variance table for the first two principal component

```{python}
# Create a DataFrame with the explained variance and cumulative variance
explained_variance = pca.explained_variance_ratio_
cumulative_variance = pca.explained_variance_ratio_.cumsum()

pca_table = pd.DataFrame({'Principal Component': [f"PC{i+1}" for i in range(len(explained_variance))],
                          'Explained Variance': explained_variance,
                          'Cumulative Variance': cumulative_variance})

print(pca_table)

```

### PCA Biplot

```{python}
# Get the loadings
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)

# Create a new matplotlib figure and axis
fig, ax = plt.subplots(figsize=(10, 7))

# Plot the loadings for each feature as arrows(I don't know how to draw biplots, used some assitance from Chatgpt)
for i, (loading1, loading2) in enumerate(loadings):
    ax.arrow(0, 0, loading1, loading2, head_width=0.05, head_length=0.1, length_includes_head=True, color='red')
    plt.text(loading1 * 1.2, loading2 * 1.2, data.columns[i], color='black', ha='center', va='center')

# Set plot labels and title
ax.set_xlabel('First Principal Component')
ax.set_ylabel('Second Principal Component')
ax.set_title('PCA Biplot')
ax.axhline(0, color='grey', lw=1)
ax.axvline(0, color='grey', lw=1)
ax.grid(True)

# Show the plot
plt.show()
```

**Interpretation**

-   The PCA analysis results show that the first two principal components (PC1 and PC2) plays a significant portion of the variance in the dataset. PC1 explains `25.6414%` of the variance, and when combined with PC2 (`10.5013%`), they contain 36.1428% of the variance in total.

-   The scree plot and the table reflect a rapid drop in variance explained by each subsequent principal component after the first, which is typical in PCA analysis. This suggests that PC1 captures the most significant variance in the data, but there's still a meaningful amount of variation represented in PC2.

-   Depending on the specific context of data science analysis, considering reducing the dimensionality of the data to these two components for further analysis might be beneficial.

-   The PCA biplot demonstartes that variables such as `DIABP` (diastolic blood pressure), `AGE`, `HYPERTEN` (hypertension), `SYSBP` (systolic blood pressure), and `PREVHYP` (previous hypertension) have a pronounced influence on the first principal component. Conversely, variables like `CURSMOKE` (current smoker status), `CIGPDAY` (cigarettes per day), and `TIMEHYP` (time to hypertension development) has a significant impact on the second principal component. This distinction provided by PCA enables optimal feature engineering. The PCA analysis effectively reduces the complexity of the data, allowing for a more manageable interpretation of the key features.

## t-SNE Method

```{python}
# t-SNE analysis with different perplexity values
ps = [5, 30, 50, 100]

for p in ps:
    tsne = TSNE(n_components=2, perplexity=p, random_state=42)
    tsne_result = tsne.fit_transform(data_scaled)
    tsne_df = pd.DataFrame(tsne_result, columns=['TSNE1', 'TSNE2'])

    sns.scatterplot(data=tsne_df, x='TSNE1', y='TSNE2')
    plt.title(f't-SNE with Perplexity {p}')
    plt.xlabel('Component 1')
    plt.ylabel('Component 2')
    plt.show()

```

**Interpretation**

-   With `perplexity 5`, the plot shows numerous small clusters, indicating that the model is capturing more local structure in the data.

-   `Perplexity 30` presents fewer, larger clusters, suggesting a more balanced view that incorporates broader data relationships.

-   At `perplexity 50`, the clusters are less distinct but still separate, which might mean the model is starting to prioritize the global data structure over local nuances.

-   Finally, `perplexity 100` leads to even more overlap between clusters, showing that the model is now focusing mainly on the broader patterns in the data.

## Conclusions

-   PCA and t-SNE are both powerful techniques for dimensionality reduction, each with distinct characteristics suitable for different types of data analysis.

-   PCA is a linear technique that reduces dimensions by transforming the data into a new coordinate system where the greatest variances by any projection of the data come to lie on the first coordinates, known as principal components. It is proficient at preserving global structure and is computationally efficient, making it suitable for datasets where linear relationships are dominant.

-   On the other hand, t-SNE is a non-linear technique, which is proficient at finding out local structures and clusters within the data. Unlike PCA, t-SNE can capture non-linear relationships by mapping the high-dimensional data to a lower-dimensional space in a way that preserves the data's local neighborhood structure. This makes t-SNE particularly useful for exploratory data analysis and for datasets where the underlying structure is non-linear.